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I — 1 , Abstract 

Q ■ Identifying viral pathogens and characterizing their transmission is essential to developing effective 

• 1— ( ' public health measures in response to a pandemic. Phylogenetics, though currently the most popular 

' tool used to characterize the likely host of a virus, can be ambiguous when studying species very 

^H. distant to known species and when there is very little reliable sequence information available in 

the early stages of the pandemic. Motivated by an existing framework for representing biological 
sequence information, we learn sparse, tree-structured models, built from decision rules based on 
subsequences, to predict viral hosts from protein sequence data using popular discriminative machine 
learning tools. Furthermore, the predictive motifs robustly selected by the learning algorithm are 
found to show strong host-specificity and occur in highly conserved regions of the viral proteome. 
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1. INTRODUCTION 

Emerging pathogens constitute a continuous threat to our society, as it is notoriously difficult to perform a realistic 
k> ' assessment of optimal public health measures when little information on the pathogen is available. Recent outbreaks 
; include the West Nile virus in New York (1999); SARS coronavirus in Hong Kong (2002-2003); LUJO virus in Lusaka 

■ (2008); HlNl influenza pandemic virus in Mexico and the US (2009); and cholera in Haiti (2010). In all these cases, 
an outbreak of unusual clinical diagnoses triggered a rapid response, and an essential part of this response is the 
accurate identification and characterization of the pathogen. 

Sequencing is becoming the most common and reliable technique to identify novel organisms. For instance, LUJO 
was identified as a novel, very distinct virus after the sequence of its genome was compared to other arenaviruses [l[. 
The genome of an organism is a unique fingerprint that reveals many of its properties and past history. For instance, 
arenaviruses are zoonotic agents usually transmitted from rodents. 

Another promising area of research is metagenomics, in which DNA and RNA samples from different environments 
are sequenced using shotgun approaches. Metagenomics is providing an unbiased understanding of the different species 
that inhabit a particular niche. Examples include the human microbiome and virome, and the Ocean metagenomics 
collection It has been estimated that there are more than 600 bacterial species living in the mouth but that only 
20% have been characterized. 
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Pathogen identification and metagenomic analysis point to an extremely rich diversity of unknown species, where 
partial genomic sequence is the only information available. The main aim of this work is to develop approaches 
that can help infer characteristics of an organism from subsequences of its genomic sequence where primary sequence 
information analysis does not allow us to identify its origin. In particular, our work will focus on predicting the host 
of a virus from the viral genome. 

The most common approach to deduce a likely host of a virus from the viral genome is sequence / phylogenetic 
similarity (i.e., the most likely host of a particular virus is the one that is infected by related viral species). However, 
similarity measures based on genomic / protein sequence or protein structure could be misleading when dealing with 
species very distant to known, annotated species. Other approaches are based on the fact that viruses undergo 
mutational and evolutionary pressures from the host. For instance, viruses could adapt their codon bias for a more 
efficient interaction with the host translational machinery or they could be under pressure of deaminating enzymes 
(e.g. AP0BEC3G or HIV infection). All these factors imprint characteristic signatures in the viral genome. Several 
techniques have been developed to extract these patterns (e.g., nucleotide and dinucleotide compositional biases, and 
frequency analysis techniques 0] ) . Although most of these techniques could reveal an underlying biological mechanism, 
they lack sufficient accuracy to provide reliable assessments. A relatively similar approach to the one presented here 
is DNA barcoding. Genetic barcoding identifies conserved genomic structures that contain the necessary information 
for classification. 

Using contemporary machine learning techinques, we present an approach to prediciting the hosts of unseen viruses, 
based on the amino acid sequences of proteins of viruses whose hosts are well known. Using sequence and host 
information of known viruses, we learn a multi-class classifier composed of simple sequence-motif based questions 
(e.g., does the viral sequence contain the motif 'DALMWLPD' ?) that achieves high prediction accuracies on held- 
out data. Prediction accuracy of the classifier is measured by the area under the ROC curve, and is compared to 
a straightforward nearest-neighbour classifier. Importantly (and quite surprisingly), a post-processing study of the 
highly predictive sequence-motifs selected by the algorithm identifies strongly conserved regions of the viral genome, 
facilitating biological interpretation. 

2. METHODS 

Our overall aim is to discover aspects of the relationship between a virus and its host. Our approach is to develop 
a model that is able to predict the host of a virus given its sequence; those features of the sequence that prove most 
useful are then assumed to have a special biological significance. Hence, an ideal model is one that is parsimonious and 
easy to interpret, whilst incorporating combinations of biologically relevant features. In addition, the interpretability 
of the results is improved if we have a simple learning algorithm which can be straightforwardly verified. 

Formally, for a given virus family, we learn a function g : S ^ "H, where S is the space of viral sequences and Ti. 
is the space of viral hosts. The space of viral sequences S is generated by an alphabet A where, |^| = 4 (genome 
sequence) or |^| = 20 (primary protein sequence). 

Defining a function on a sequence requires representation of the sequence in some feature space. Below, we specify 
a representation (f) : S X . where a sequence s e 5 is mapped to a vector of counts of subsequences x ^ X C Nq . 
Given this representation, we have the well-posed problem of finding a function f : X ^ % built from a space of 
simple binary-valued functions. 

A. Collected Data 

The collected data consist of N genome sequences or primary protein sequences, denoted si . . . sn , oi viruses whose 
host class, denoted hi . . . is known. For example, these could be 'plant', 'vertebrate' and 'invertebrate'. The label 
for each virus is represented numerically as y G 3^ = {0, l}'^ where [y]; = 1 if the index of the host class of the virus 
is I, and where L denotes the number of host classes. Note that this representation allows for a virus to have multiple 
host classes. Here and below we use boldface variables to indicate vectors and square brackets to denote the selection 
of a specific element in the vector, e.g., [y„]; is the /"^ element of the n"^ label vector. 

B. Mismatch Feature Space 

A possible feature space representation of a viral sequence is the vector of counts of exact matches of all possible 
/c- length subsequences (fc-mers). However, due to the high mutation rate of viral genomes a predictive function 

learned using this simple representation of counts would fail to generalize well to new viruses. Instead, motivated by , 
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we count not just the presence of an individual fc-mer but also the presence of subsequences within m mismatches from 
that fc-mer. The mismatch- or m-neighborhood of a fc-mer a, denoted A/"™, is the set of all fc-mers with a Hamming 
distance [3] at most m from it, as shown in Table HI Let Sj\fr^ denote the indicator function of the m-neigbourhood of 
a such that 

<5k-(/3)= otherwise. 



TABLE L Mismatch feature space representation. The mismatch feature space representation of a segment of a protein 
sequence . . . AQGPRIYDDTCQHPS WWMNFE VR GSP . . . 
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We can then define, for any possible fc-mer /3, the mapping cf) from the sequence s onto a count of the elements in 
/3's m-neighbourhood as 

a I —k 

Finally, the d*"^ element of the feature vector for a given sequence is then defined elementwise as 

Md = (I>k,m{^,f3d) (3) 

for every possible fc-mer (3d £ A'', where d = I . . . D and D = 

Note that when to = 0, (jjkfi exactly captures the simple count representation described earlier. This biologically 
realistic relaxation allows us to learn discriminative functions that better capture rapidly mutating and yet functionally 
conserved regions in the viral genome facilitating generalization to new viruses. 



C. Alternating Decision Trees 



Given this representation of the data, we aim to learn a discriminative function that maps features x onto host 
class labels y, given some training data {(xi,yi), . . . , (x7v,yAr)}. We want the discriminative function to output a 
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measure of "confidence" [Sj] in addition to a predicted host class label. To this end, we learn on a class of functions 
f : A" — >■ R^, where the indices of positive elements of f (x) can be interpreted as the predicted labels to be assigned 
to X and the magnitudes of these elements to be the confidence in the predictions. 

A simple class of such real- valued discriminative functions can be constructed from the linear combination of simple 
binary-valued functions ip : X ^ {0, 1}. The functions can, in general, be a combination of single-feature decision 
rules or their negations: 

p 

f(x) = 5]apVp(x) (4) 

p=i 

^p(x) = Y[ lixd > 9d) (5) 
deSp 

where G K^, P is the number of binary- valued functions, I(-) is 1 if its argument is true, and zero otherwise, 
6 {0, 1, . . . , Q}, where O = max^i „[x„]rf. and Sp is a subset of feature indices. This formulation allows functions to 
be constructed using combinations of simple rules. For example, we could define a function ip as the following 

V'(x) = 1(2:5 > 2) X > 1) X > 4) (6) 

where ^I(-) = 1 

Alternatively, we can view each function ipp to be parameterized by a vector of thresholds 6p G {0, 1, . . . , 6}^, 
where [9p]d = indicates ipp is not a function of the d*^ feature [xj^. In addition, following 0, we can decompose the 
weights SLp = QfpVp into a vote vector v G {+1, ^1}^ and a scalar weight a G IR+. The discriminative model, then, 
can be written as 

p 

fW = ^apVpVe^(x), (7) 

D 

ij{x;9p) = Y[l{xd > [OpU). (8) 

d=l 

Every function in this class of models can be concisely represented as an Alternating Decision Tree (ADT) 
Similar to ordinary decision trees, ADTs have two kinds of nodes: decision nodes and output nodes. Every decision 
node is associated with a single-feature decision rule, the attributes of the node being the relevant feature and 
corresponding threshold. Each decision node is connected to two output nodes corresponding to the associated decision 
rule and its negation. Thus, binary-valued functions in the model come in pairs {ip^ip)] each pair is associated with 
the the pair of output nodes for a given decision node in the tree (see Figure [1]). Note that "0 and -0 share the same 
threshold vector 6 and only differ in whether they contain the associated decision rule or its negation. The attributes 
of the output node pair are the vote vectors (v,v) and the scalar weights {a, a) associated with the corresponding 
functions 

Each function tp has a one-to-one correspondence with a path from the root node to its associated output node 
in the tree; the single-feature decision rules in ip being the same as those rules associated with decision nodes in 
the path, with negations applied appropriately. Combinatorial features can, thus, be incorporated into the model by 
allowing for trees of depth greater than 1. Including a new function ip in the model is, then, equivalent to either 
adding a new path of decision and output nodes at the root node in the tree or growing an existing path at one of 
the existing output nodes. This tree-structured representation of the model will play an important role in specifying 
how Adaboost, the learning algorithm, greedily searches over an exponentially large space of binary-valued functions. 
It is important to note that, unlike ordinary decision trees, each example runs down an ADT through every path 
originating from the root node. 



D. Multi-class Adaboost 



Having specified a representation for the data and the model, we now briefly describe Adaboost, a large-margin 
supervised learning algorithm which we use to learn an ADT given a data set. Ideally, a supervised learning algorithm 
learns a discriminative function f*(x) that minimizes the number of mistakes on the training data, known as the 
Hamming loss 0: 

r(x) = arg mm £,(£)= ^ I (i7([f (x„)]0 ^ [y„]/) (9) 

l<n<N 
1<1<L 
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FIG. 1: Alternating Decision Tree. An example of an ADT where rectangles are decision nodes, circles are output nodes 
and, in each decision node, [/?] = (pk,m{s,l3) is the feature associated with the fc-mer /3 in sequence s. The output nodes 
connected to each decision node are associated with a pair of binary- valued functions {tp,tp). The binary- valued function 
corresponding to the highlighted path is given as '4){y.;ei) = I([AKNELSID] > 2) x ^I([AAALASTM] > 1) and the associated 
a = 0.3. 



where H{.) denotes the Heaviside function. The Hamming loss, however, is discontinuous and non-convex, making 
optimization intractable for large-scale problems. 

Adaboost is the unconstrained minimization of the exponential loss, a smooth, convex upper-bound to the Hamming 
loss, using a coordinate descent algorithm. 

f*(x) = argmin/:e(f) = ^ exp (-[y„];[/;(x„)];) . (10) 

l<n<Af 
l<i<L 

Adaboost learns a discriminative function f (x) by iteratively selecting the ip that maximally decreases the exponential 
loss. Since each if) is parameterized by a D-dimensional vector of thresholds 6, the space of functions ip is of size 
0((8 + l)"^), where Q is the largest /c-mer count observed in the data, making an exhaustive search at each iteration 
intractable for high-dimensional problems. 

To avoid this problem, at each iteration, we only allow the ADT to grow by adding one decision node to one of 
the existing output nodes. To formalize this, let us define Z{6) = {d : [0]d ^ 0} to be the set of active features 
corresponding to a function ■0- At the t'^ iteration of boosting, the search space of possible threshold vectors is then 
given as {6 : 3t < t,Z{0) D Z{9r), \Z{9)\ — \Z{Or)\ ~ 1}. In this case, the search space of thresholds at the i"^ 
iteration is of size 0{tQD) and grows linearly in a greedy fashion at each iteration (see Figure[T]). Note, however, that 
this greedy search, enforced to make the algorithm tractable, is not relevant when the class of models are constrained 
to belong to ADTs of depth 1. 

In order to pick the best function ip, we need to compute the decrease in exponential loss admitted by each function 
in the search space, given the model at the current iteration. Formally, given the model at the i"^ iteration, denoted 
f*(x), the exponential loss upon inclusion of a new decision node, and hence the creation of two new paths {'>pe,'4'9): 
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into the model can be written as 

N L 



£e(f*+') = ^^exp(-[y„]/[f*(x„) + avV;e(x„)+av^^e(x„)];) (11) 

n=l (=1 
N L 

^^'^lii^^V (-[yn];[avV'e(x„) + a\nl,g{x„)]ij (12) 



n=l 1=1 
N L 

EE 

Tl=l 1 = 1 

where = exp (— [y„]; [f*(x„)]/). Here w^; is interpreted as the weight on each sample, for each label, at boosting 
round t. If, at boosting round t — 1, the model disagrees with the true label I for sample n, then w*^; is large. If the 
model agrees with the label then the weight is small. This ensures that the boosting algorithm chooses a decision rule 
at round t, preferentially discriminating those examples with a large weight, as this will lead to the largest reduction 
in Ce- 

For every possible new decision node that can be introduced to the tree, Adaboost finds the (q;,v) pair that 
minimizes the exponential loss on the training data. These optima can be derived as 

[v1,4\ '\<^^<^ (13) 
— 1 otherwise 



1 wi 
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where for each new path ipn associated with each new decision node 

= E <i (15) 

n,l:viil)„yril=±l 

Corresponding equations for the (a,v) pair can be written in terms of ; and W^. obtained by replacing ipn with 
tpn in the equations above. The minimum loss function for the threshold 6 is then given as 



Ce{i*+^) = 2,JW^W^ + 2^JW[WL + Wl (17) 

where ~ J2n i-4i„=4}„=o'^ni- Based on these model update equations, each iteration of the Adaboost algorithm 
involves building the set of possible binary-valued functions to search over, selecting the one for which the loss function 
given by Eq. 1171 and computing the associated (a,v) pair using Eq. [Inland Eq. 1141 

3. RESULTS 
A. Data specifications 

We aim to learn a predictive model to identify hosts of viruses belonging to a specific family; we show results 
for Picornaviridae and Rhabdoviridae. Picornaviridae is a family of viruses that contain a single stranded, positive 
sense RNA. The viral genome usually contains about 1-2 Open Reading Frames (ORF), each coding for protein 
sequences about 2000-3000 amino acids long. Rhabdoviridae is a family of negative sense single stranded RNA viruses 
whose genomes typically code for five different proteins: large protein (L), nucleoprotein (N), phosphoprotein (P), 
glycoprotein (G), and matrix protein (M). The data consist of 148 viruses in the Picornaviridae family and 50 viruses 
in the Rhabdoviridae family. For some choice of k and to, we represent each virus as a vector of counts of all 
possible fc-mers, up to m-mismatchcs, generated from the amino-acid alphabet. Each virus is also assigned a label 
depending on its host: vertebrate / invertebrate / plant in the case of Picornaviridae, and animal / plant in the case 
of Rhabdoviridae (see Table SI for the names and label assignments of viruses). Using multiclass Adaboost, we learn 
an ADT classifier on training data drawn from the set of labeled viruses and test the model on the held-out viruses. 
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B. BLAST Classifier accuracy 

Given whole protein sequences, a straightforward classifier is given by a nearest-neighbour approach based on the 
Basic Local Alignment Search Tool (BLAST) [ll|. We can use BLAST score (or P- value) as a measure of the distances 
between the unknown virus and a set of viruses with known hosts. The nearest neighbor approach to classification 
then assigns the host of the closest virus to the unknown virus. Intuitively, as this approach uses the whole protein 
to perform the classification, we expect the accuracy to be very high. This is indeed the case - BLAST, along with 
a 1-nearest neighbor classifier, successfully classifies all viruses in the Rhabdoviridae family, and all but 3 viruses in 
the Picornaviridae family. What is missing from this approach, however, is the ability to ascertain and interpret host 
relevant motifs. 



C. ADT Classifier accuracy 

The accuracy of the ADT model, at each round of boosting, is evaluated using a multi-class extension of the Area 
Under the Curve (AUG). Here the 'curve' is the Receiver Operating Gharacteristic (ROG) which traces a measure of 
the classification accuracy of the ADT for each value of a real- valued discrimination threshold. As this threshold is 
varied, a virus is considered a true (or false) positive if the prediction of the ADT model for the true class of that 
protein is greater (or less) than the threshold value. The ROG curve is then traced out in True Positive Rate - False 
Positive Rate space by changing the threshold value and the AUG score is defined as the area under this ROG curve. 

The ADT is trained using 10-fold cross validation, calculating the AUG, at each round of boosting, for each fold 
using the held-out data. The mean AUG and standard deviation over all folds is plotted against boosting round in 
Figures [H [31 Note that the 'smoothing effect' introduced by using the mismatch feature space allows for improved 
prediction accuracy for m > 0. For Picornaviridae, the best accuracy is achieved at m = 5, for a choice of fc = 12; 
this degree of 'smoothing' is optimal for the algorithm to capture predictive amino-acid subsequences present, up 
to a certain mismatch, in rapidly mutating viral protein sequences. For Rhabdoviridae, near perfect accuracy is 
achieved with merely one decision rule, i.e., Rhabdoviridae with plant or animal hosts can be distinguished based on 
the presence or absence of one highly conserved region in the L protein. 

(a) k = 12 (Picornaviridae) (b) 




boosting round boosting round 



FIG. 2: Prediction accuracy for Picornaviridae. A plot of (a) mean AUG vs boosting round, and (b) 95% confidence 
interval vs boosting round. The mean and standard deviation were computed over 10-folds of held-out data, for Picornaviridae, 
where k = 12. 
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FIG. 3; Prediction accuracy for Rhabdoviridae. A plot of (a) mean AUG vs boosting round, and (b) 95% confidence 
interval vs boosting round. The mean and standard deviation were computed over 5-folds of held-out data, for Rhabdoviridae, 
where k — 10. The relatively higher uncertainty for this virus family was likely due to very small sample sizes. Note that the 
cyan curve lies on top of the red curve. 



D. Predictive subsequences are conserved within hosts 



Having learned a highly predictive model, we would like to locate where the selected fc-mers occur in the viral 
proteomes. We visualize the fc-mer subsequences selected in a specific ADT by indicating elements of the mismatch 
neighborhood of each selected subsequence on the virus protein sequences. In Figure HI the virus proteomes are 
grouped vertically by their label with their lengths scaled to [0, 1]. Quite surprisingly, the predictive fc-mers occur in 
regions that are strongly conserved among viruses sharing a specific host. Note that the representation we used for 
viral sequences retained no information regarding the location of each fc-mer on the virus protein. Furthermore, these 
selected fc-mers are significant as they are robustly selected by Adaboost for different choices of train / test split of 
the data, as shown in Figure [S] 



4. DISCUSSION 



We have presented a supervised learning algorithm that learns a model to classify viruses according to their host 
and identifies a set of highly discriminative oligopeptide motifs. As expected, the fc-mers selected in the ADT for 
Picornaviridae (Figure 31 [S]) and Rhabdoviridae (Figure IS.ll IS.2[) are mostly selected in areas corresponding to the 
rcplicasc motifs of the polymerase - one of the most conserved parts of the viral genome. Thus, given that partial 
genomic sequence is normally the only information available, we could achieve quicker bioinformatic characterization 
by focusing on the selection and amplification of these highly predictive regions of the genome, instead of full genomic 
characterization and contiguing. Moreover, in contrast with generic approaches currently under use, such a targeted 
amplification approach might also speed up the process of sample preparation and improve the sensitivity for viral 
discovery. 

Over representation of highly similar viruses within the data used for learning is an important source of ovcrfitting 
that should be kept in mind when using this technique. Specifically, if the data largely consist of nearly-similar viral 
sequences (e.g. different sequence reads from the same virus), the learned ADT model would overfit to insignificant 
variations within the data (even if 10- fold cross validation were employed), making generalization to new subfamilies of 
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FIG. 4: Visualizing predictive subsequences. A visualization of the mismatch neighborhood of the first 7 fc-mers selected 
in an ADT for Picornaviridae, where fc = 12, m = 5. The virus proteomes are grouped vertically by their label with their 
lengths scaled to [0, 1]. Regions containing elements of the mismatch neighborhood of each fc-mer are then indicated on the 
virus proteome. Note that the proteomes are not aligned along the selected fc-mers but merely stacked vertically with their 
lengths normalized. 



viruses extremely poor. To check for this, we hold out viruses corresponding to a particular subfamily (see Table SI for 
subfamily annotation of the viruses used), run 10-fold cross validation on the remaining data and compute the expected 
fraction of misclassified viruses in the hcld-out subfamily, averaged over the learned ADT models. For Picornaviridae, 
viruses belonging to the subfamilies Parechovirus (0.47), Tremovirus (0.8), Sequivirus (0.5), and Cripavirus (1.0) were 
poorly classified with misclassification rates indicated in parentheses. Note that the Picornaviridae data used consist 
mostly of Cripaviruses; thus, the high misclassification rate could also be attributed to a significantly lower sample 
size used in learning. For Rhahdoviridae, viruses belonging to Novirhabdovirus (0.75) and Cytorhabdovirus (0.77) were 
poorly classified. The poorly classified subfamilies, however, contain a very small number of viruses, showing that the 
method is strongly generalizable on average. 

Other applications for this technique include identification of novel pathogens using genomic data, analysis of the 
most informative fingerprints that determine host specificity, and classification of metagcnomic data using genomic 
information. For example, an alternative application of our approach would be the automatic discovery of multi-locus 
barcoding genes. Multi-locus barcoding is the use of a set of genes which are discriminative between species, in order 
to identify known specimens and to fiag possible new species [12[. While we have focused on virus host in this work, 
ADTs could be applied straightforwardly to the barcoding problem, replacing the host label with a species label. 
Additional constraints on the loss function would have to be introduced to capture the desire for suitable flanking 
sites of each selected fc-mer in order to develop the universal PGR primers important for a wide application of the 
discovered barcode [l3|. 
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FIG. 5: Visualizing predictive regions of protein sequences. A visualization of the mismatch neighborhood of the first 
7 fc-mers, selected in all ADTs over 10-fold cross validation, for Picomaviridae, where fc = 12, m = 5. Regions containing 
elements of the mismatch neighborhood of each selected fc-mer are indicated on the virus proteome, with the grayscale intensity 
on the plot being inversely proportional to the number of cross-validation folds in which some fc-mer in that region was selected 
by Adaboost. Thus, darker spots indicate that some fc-mer in that part of the proteome was robustly selected by Adaboost. 
Furthermore, a vertical cluster of dark spots indicate that region, selected by Adaboost to be predictive, is also strongly 
conserved among viruses sharing a common host type. 
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Supplementary Figures 
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FIG. S.l: Visualizing predictive subsequences for Rhabdoviridae. A visualization of the mismatch neighborhood of the 
fc-mer selected in an ADT for Rhabdoviridae, where k = 10, m — 2. The virus proteomes are grouped vertically by their label 
with their lengths scaled to [0, 1]. Regions containing elements of the mismatch neighborhood of each fc-mer are then indicated 
on the virus proteome. 
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FIG. S.2: Visualizing predictive regions for Rhabdoviridae. A visualization of the mismatch neighborhood of the fc-mers 
selected in an ADT for Rhabdoviridae, where A: = 10, m = 2. The virus proteomes are grouped vertically by their label with 
their lengths scaled to [0, 1]. Regions containing elements of the mismatch neighborhood of each fc-mer are then indicated on 
the virus proteome. 



